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Abstract —In this paper, we propose a maximum margin 
classifier that deals with uncertainty in data input. Specifically, 
we reformulate the SVM framework such that each input training 
entity is not solely a feature vector representation, hut a multi¬ 
dimensional Gaussian distribution with given probability density, 
i.e., with a given mean and covariance matrix. The latter 
expresses the uncertainty. We arrive at a convex optimization 
problem, which is solved in the primal form using a gradient 
descent approach. The resulting classifier, which we name SVM 
with Gaussian Sample Uncertainty (SVM-GSU), is tested on 
synthetic data, as well as on the problem of event detection in 
video using the large-scale TRECVID MED 2014 dataset, and 
the problem of image classification using the MNIST dataset of 
handwritten digits. Experimental results verify the effectiveness 
of the proposed classifier. 

Index Terms —Classification, convex optimization, Gaussian 
anisotropic uncertainty, large margin methods, learning with 
uncertainty, statistical learning theory 


I. Introduction 

UPPORT Vector Machine (SVM) has been shown to 
be a powerful paradigm for pattern classification. The 
origins of SVM can be traced back to [1], [2]. In [3], Vapnik 
established the standard regularized SVM algorithm where a 
linear discriminative function is computed in order to achieve 
maximum sample margin. To this end, a penalty term ap¬ 
proximating the total training error is considered along with a 
regularization term, typically chosen as a norm of the classifier, 
in order to avoid the so-called over-fitting phenomenon. From 
a statistical learning theory point of view, this is interpreted as 
follows: the regularization term restricts the complexity of the 
classifier and thus the deviation of the testing error. Hence, the 
training error is controlled (see e.g. [4], [5], [6]). The training 
data are assumed to be drawn from some unknown probability 
distribution; specifically, they are assumed to be independently 
drawn and identically distributed (“iid”). 

The majority of the classification methods do not address 
the uncertainty in the training data explicitly. That is, each 
training sample is described by its position in some vector 
space (feature representation). However, such an approach 

C. Tzelepis is with the Information Technologies Institute/Centre for 
Research and Technology Hellas (CERTH), Thermi 57001, Greece, and also 
with the School of Electronic Engineering and Computer Science, Queen 
Mary University of London, London El 4NS, U.K. (email: tzelepis@iti.gr). 

V. Mezaris is with the Information Technologies Institute/Centre for 
Research and Technology Hellas (CERTH), Thermi 57001, Greece (email: 
bmezaris@iti.gr). 

1. Patras is with the School of Electronic Engineering and Computer 
Science, Queen Mary University of London, London El 4NS, U.K. (e-mail: 
i.patras @ qmul. ac.uk). 



Fig. 1: Linear SVM with Gaussian Sample Uncertainty (SVM- 
GSU). Solid line illustrates the decision boundary of the pro¬ 
posed algorithm, and dashed line shows the decision boundary 
of the standard linear SVM. 


often does not express the true underlying process of extracting 
the feature representation. Errors are often introduced during 
sensing or feature extraction and therefore the training data 
are noisy. In this work, we model the uncertainty of each 
training example using a multivariate Gaussian distribution, 
such that the covariance matrix of each distribution is treated 
as a measure of this uncertainty. That is, we model each input 
example as a random vector following a multivariate Gaussian 
distribution with given mean vector and covariance matrix. 
In Fig.l we can see such 2D training examples, given as 
bivariate Gaussian distributions with certain mean vectors and 
covariance matrices. For the sake of visualization, we illustrate 
the uncertainty of each input training vector with the shaded 
regions, which are bounded by the iso-density loci of points 
(ellipses) described by the 0.03% of the maximum density 
of each distribution. A novel SVM formulation is developed, 
by modifying appropriately the mechanism for measuring the 
classification (empirical) error and for taking it into account 
during training. Hereafter, the proposed algorithm will be 
called SVM with Gaussian Sample Uncertainty (SVM-GSU). 
The toy example in Fig.l illustrates the motivation behind 
the proposed SVM-GSU. That is, the decision boundary of 
the SVM-GSU, shown with a solid line, may be drastically 
different than that of the standard SVM, shown with a dashed 
line, when taking into account the uncertainty associated with 
each input data. 

The remainder of this paper is organized as follows. In 
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Section II, we review related work. In Section III, we present 
the proposed SVM-GSU. In Section IV, we provide the exper¬ 
imental results of the application of SVM-GSU to synthetic 
data, the TRECVID MED 2014 dataset, as well as the MNIST 
dataset, along with comparisons with the standard SVM, and 
other state of the art methods. We discuss conclusions in 
Section V. 

II. Related Work 

Assuming uncertainty in input under the SVM paradigm 
is not new. Different types of Robust SVMs have been 
proposed in several recent works. Bi and Zhang [7] considered 
a statistical formulation where the input noise is modeled 
as a hidden mixture component, but in this way the “iid” 
assumption for the training data is violated. In that work, 
the uncertainty is modeled isotropically. Second order cone 
programming (SOCP) [8] methods have also been employed 
in numerous works to handle missing and uncertain data. In 
addition. Robust Optimization [9], [10] techniques have been 
proposed for optimization problems where data is not specified 
exactly, but it is known to belong to a given uncertainty set 
U, yet the constraints of the optimization problem must hold 
for all possible values of the data from lA. 

Lanckriet et al. [11] considered a binary classification 
problem where the mean and covariance matrix of each class 
are assumed to be known. Then, a minimax problem is 
formulated such that the worst-case (maximum) probability of 
misclassification of future data points is minimized. That is, 
under all possible choices of class-conditional densities with a 
given mean and covariance matrix, the worst-case probability 
of misclassification of new data is minimized. Eor doing so, 
the authors exploited generalized Chebyshev inequalities [12] 
and particularly a theorem according to which the probability 
of misclassifying a point is bounded. 

Shivaswamy et al. [13], who extended Bhattacharyya et 
al. [14], also adopted a second order cone programming 
formulation and used generalized Chebyshev inequalities to 
design robust classifiers dealing with uncertain observations. 
Then uncertainty arises in ellipsoidal form, as follows directly 
from the multivariate Chebyshev inequality. This formulation 
achieves robustness by requiring that the ellipsoid of every 
uncertain data point should lie in the correct half-space. The 
expected error of misclassifying a sample is obtained by 
computing the volume of the ellipsoid that lies on the wrong 
side of the hyperplane. However, this quantity is not computed 
analytically; instead, a large number of uniformly distributed 
points are generated in the ellipsoid, and the fraction of the 
number of points on the wrong side of the hyperplane to the 
total number of generated points is computed. 

Xu et al. [15], [16] considered the robust classification 
problem for a class of non-box-typed uncertainty sets, in 
contrast to [14], [13], [11], who robustified regularized classi¬ 
fication using box-type uncertainty. That is, they considered a 
setup where the joint uncertainty is the Cartesian product of 
uncertainty in each input, leading to penalty terms on each 
constraint of the resulting formulation. Eurthermore, Xu et 
al. gave evidence on the equivalence between the standard 


regularized SVM and this robust optimization formulation, 
establishing robustness as the reason why regularized SVMs 
generalize well. 

In [17], motivated by GEPSVM [18], Qi et al. robusti¬ 
fied a twin support vector machine (TWSVM) [19]. Robust 
TWSVM deals with data affected by measurement noise using 
a second order cone programming formulation. In their work, 
input data is contaminated with isotropic noise (i.e., spherical 
disturbances centred at the training samples), and thus cannot 
model real-world uncertainty, which is typically described by 
more complex noise patterns. Our proposed classifier, which is 
presented below, does not violate the “iid” assumption for the 
training input data, while it can model the uncertainty of each 
input training example using an arbitrary covariance matrix, 
consequently permitting the uncertainty to be anisotropic. 
Moreover, the expected error is computed analytically and is 
minimized by an iterative gradient descent algorithm whose 
complexity is linear with respect to the number of training 
data. Einally, we apply a linear subspace learning approach in 
order to solve the problem in lower-dimensional spaces, and 
thus accelerate the training stage. Learning in subspaces is 
widely used in various statistical learning problems [20], [21], 
[22], [23]. 

III. Proposed Approach 

As discussed above, in this section we develop a new 
algorithm, in which the training set that feeds the proposed 
classifier includes training examples described not solely by a 
set of feature representations, i.e. a set of vectors in some n- 
dimensional space, but rather by a set of multivariate Gaussian 
distributions; that is, every training data is characterized by a 
mean vector Xi G T> and a covariance matrix G 
A linear formulation is proposed below, while an approxima¬ 
tion formulation dealing with learning in linear subspaces is 
discussed next. 

A. SVM with Gaussian Sample Uncertainty (SVM-GSU) 

Let us briefly begin with the baseline SVM algorithm, which 
will endow us with arguments necessary for generalizing and 
proceeding to the proposed approach. We consider the super¬ 
vised learning framework where a set of I annotated observa¬ 
tions is available. That is, each observation consists of a vector, 
Xi, in some n-dimensional vector space, let D C and an 
associated label, yi G {±1}. Let us denote the training set by 
A' = {(x^,yi): Xi G M”, yi G {±1},2 = 1,...,/}. Then, the 
baseline linear SVM [3] learns a hyperplane ?{: w -x-hb = 0 
that minimizes with respect to w, b the following objective 
function 

1 ^ 

-||wp + max(0,1 - yi(w ■ x* -f b)), (1) 

i=l 

where h{y,t) = max(0,1 — yt) is known as the “hinge loss” 
function [24]. 

is typically a subset of the n-dimensional Euclidean space of column 
vectors, while denotes the convex cone of all symmetric positive definite 
n X n matrices with entries in T) C 

^For the rest of this paper, we will assume that T> = 
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In this work, we assume that instead of the z-th training 
example, we are given a multivariate Gaussian distribution 
with mean vector x^, and covariance matrix One could 
think of this as that the covariance matrix, describes the 
uncertainty about the position of the training sample around 
Xi. Formally, we define random variables, X^, each of which 
follows an n-dimensional Gaussian distribution with mean 
vector Xi e M", and covariance matrix a symmetric positive 
definite n x n matrix, G §” + • The probability density 
function (pdf) of the z-th Gaussian distribution is given by 
/x,: M” ^ M, with 

fx M =--r exp ( —-(x — Xi)^E“^(x — xO ) . 

(2) 

Adopting the above assumption for the input training vectors, 
we can express the training set as a set of I annotated 
Gaussian distributions, i.e., A" = {(x^, Si, yt): Xi G M", Si G 
S"_|_, yi G {±1}, i = 1, ■ ■ ■ ,1}■ The optimization problem, in 
its unconstrained form, is then formulated as follows 

min^llwlp + Cy^ [ max (0,1 - z/i(w • x +6))/xi (x) dx, 

2 ^ V 

(3) 

or, 

f [l “ ' x + 6)]/x.(x) dx, (4) 

where denotes the half-space of M" that is defined by the 
hyperplane H' : yi{'w ■ x + b) = 1 as fti = {x G M": z/i(w • 
X -f 6) < 1}, and is the half-space to which a misclassified 
sample lies. 

Note that the loss function C: (M” x M) x (M” x S”+ x 
{±1}) —)■ M that can be defined for the samples drawn from 
the z-th Gaussian, that is. 


£{-w,b,Xi,Si,yi) = / [l-z/i(wx-f 5)]/x,(x)dx, (5) 


is the expected value of the hinge loss. Using the Theorem 
1 proved in Appendix A, for the half-spaces ftf = {x G 
M” : W-X-I-&— 1 > O}, and U” = {x G M": w-x-|-&—1 < O}, 
the above integral is evaluated in terms of w and b as follows 


C{yv,b,Xi,Si,yi) = 


1 - yj(w -Xi + b) 


Vi erf 


z/, - (w • X, -f b) 


-f 1 




^ 2'w^ SiW 
[y* - (w • X, + 5)]2 


where erf: 


2w^EiW 

(—1,1) is the error function, defined as 

erf(a;) = [ e“* df. 

Vtt Jo 


( 6 ) 


As stated above, the covariance matrix of each training random 
vector describes its uncertainty, and as the covariance matrix 
approaches to the zero matrix, the certainty increases. At the 
extreme, as Si —> 0, after applying function analysis, (6) 
yields 1 — yi{w ■ Xi + h), which is the hinge loss function 


used in the standard SVM formulation [3], [25], [24]. That 
implies that the proposed formulation is a generalization of 
the standard SVM; the two classifiers are equivalent when the 
covariance matrices tend to the zero matrix*. 

Let (K” X M) X (K" x x {±1}) ^ M be the 
objective function of the SVM-GSU formulation, i.e., 

1 J 

J{w,b,Xi,S„yi) = -\\vf\\'^ + C'^C{-w,b,x„Si,y,), (7) 

i=l 


which is convex as proved in Appendix B. 

To solve the convex optimization problem (4), the Limited- 
memory BFGS (L-BFGS) algorithm has been employed^. L- 
BFGS belongs to the family of quasi-Newton methods and 
approximates the BFGS algorithm [26] using a limited amount 
of memory. L-BFGS requires the first-order derivatives with 
respect to the optimization variables w, b. Then, the objective 
function is minimized jointly for w, b and a (global) optimal 
solution is achieved. By differentiating JJ with respect to w 
and 6, we obtain, respectively. 


dw 


j{w, 6) = w -I- c 


„„„ f [yi-(w-Xi-rb)]^ 
2wTS,w 


\/2'K'w^Si\ 


S,w 




z/— (w ■ X, -I- b) 
\j2'w^SiW 


+ yi]xi 


( 8 ) 


7 


J (w, b) 





/ yi - {'N ■ Xj + h) \ 

Y 72v77Vw j 


+ yi 


(9) 

By applying L-BFGS on the problem of (4), we obtain the 
optimal values of the parameters w, b defining the SVM- 
GSU’s learned separating hyperplane. 

Then, given this hyperplane 7(: w • x -f 5 = 0, an unseen 
testing datum, x*, is classified to one of the two classes 
according to the sign of the (signed) distance between xt and 
the separating hyperplane. That is, the predicted label of x* is 
computed as yt = sgn{dt), where dt = (w-Xt-|-5)/||w||, while 
a probabilistic degree of confidence (DoC) that the testing 
sample belongs to the class to which it has been classified 
can be calculated using the well-known sigmoid function, 
S{dt) = 1/(1 + This is the same approach that is used 

in the baseline linear SVM formulation [27] for evaluating a 
sample’s class membership at the testing phase. 


B. Solving the SVM-GSU in linear subspaces 

Since learning in the original n-dimensional input space 
may introduce computationally expensive terms, in this section 
we propose a methodology for approximating the loss function 
of SVM-GSU, by projecting each input random vector into 
a linear subspace. The dimensionality of each subspace is 
defined by preserving a given fraction of the total variance 
for each covariance matrix. Then, the total loss, as well as 


zero covariance matrix exists due to the well known property that the 
set of symmetric positive definite matrices is a convex cone with vertex at 
zero. 

framework for training and testing the linear SVM-GSU has been 
developed in C and is publicly available at <withheld during reviewing>. 
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its first derivatives, are computed separately in each subspace. 
A comprehensive analysis of the above method is discussed 
below. 

By performing eigenanalysis in the covariance matrix of X^, 
the latter is decomposed as follows 

Si = U,K,Uj, (10) 


where Ai is an n x n diagonal matrix consisting of the 
eigenvalues of Ei, i.e. Ai = diag(A5^,..., A^), such that A^^ > 
A| A^ >0, while CA is an n x n orthonormal matrix, 

whose j -th column, , is the eigenvector corresponding to the 
j-th eigenvalue. A*. Let us keep the first di <n eigenvectors, 
such that a certain percentage e (e.g. e = 90%) of the total 
variance is preserved, i.e. 




> e. 


Then, we construct the n x di matrix U[ by keeping the first 
di columns of CA, i.e.. 


C/' = [ul u* ... (11) 

Now, by using the matrix Pi = Uf^ € we define a 

new random vector Zi, such that 


Zi — Pil^i- 


( 12 ) 


Then, Zi € follows a multivariate Gaussian distribution 
(since Xi ~ A/'(xi,Ei)), i.e. Zi ~ A/'(zi,S|), with mean 
vector 

z, = E[P,X,] = P,E[X,] = G (13) 

and covariance matrix 


=K =diag(Ai,...,Ad.). 


(14) 


The probability density function of Zi is given by /z.: 
M, with 


/z.(z) = ;—- -exp(- i(z-z,)^Sf ^(z-z,)]. 

(15) 

Pj is a projection matrix from M" to the di-dimensional space 
Let us now see how the integral in (4) is approximated 
in the new space. To this end, the following holds true 


W • X « W^(Pi^z) = (Pi' z) ' W = Z ' PiW = (PiW) • z, 


dT„nT 


or, by letting = PiW, 


w • X « Wz • z. 

Consequently, the integral in the RHS of (4) can be approxi¬ 
mated by the quantity 


1 - 2 / 4 w^-z + 6) /z.(z)dz, 


where flf denotes the projected half-space on that is. 


= 


|z G : yi{^z • z -I- 5) < l|. 


Using Theorem 1, which is proved in Appendix A, the above 
integral is equal to 


1 - Vii'^z ■2‘t + b) 






2 /»erf 


exp - 


yi - (w^ ■ z, -f 6) \ ^ 

, \/2wJ Sf ) 

[y* - (w^ ■ Zj -I- b)\‘ 
2wjEfw, 


1 


■ (16) 


Therefore, for each training example (i.e., for each random 
vector that follows a given Gaussian distribution), the loss 
function £f: x M) x x x {±1}), is given by 

A 


1 - yii'^z ■2‘t + b) 


-f 




y*erf 


• exp 


/:j'(w^,6,Zi,S,AyA = 
yi - (w^ -z^ + b) 

^2wJ Sf 


[Ui - (w^ • Zi -f b)f 


2wJ Ef 


. (17) 


Therefore, the objective function J' : M" x 
(7) can be approximated as follows 


I, given by 


J'{w,b) = ^WwW^ + C'^mP,w,b,z„^,y,). (18) 

i=l 

Following similar arguments as in the case of learning in the 
original space, JA' can be shown to be convex (see Appendix 
B). 

The first derivative of J' with respect to w is given as 
follows 

-^J'iw,b) = /:f(w^,6,z,,EAy*). (19) 


i=l 


dw 


Thus, by using the chain rule. 


dw 


J'iw,b) = 




dw 


( 20 ) 


where 


dw^d 

= - PiW = Pi. 


dw dw * 

By differentiating with respect to w^, (20) yields 


cE 


exp 


( 


[yi-(w^-Zi+&)]^ 

2w J S-^Wz 




^27rwJ Ef 

yi - (w^ -Zi + b) 


p7 


^2^ 


w. 


+ y^\P^ z* 


( 21 ) 


Moreover, the first derivative of J' with respect to b is given 
as follows 




c 


erf 


yi - (w^ -Zj + b) 

^2wjEfw^ 


( 22 ) 


where = P^w, Ef = PiY^iPj. 
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At the implementation level, for solving the SVM-GSU in 
linear subspaces, the eigenanalysis of the covariance matrices 
Si is performed only once per each Gaussian distribution, 
before the optimization procedure begins. Consequently, the 
following orthonormal matrices and vectors are computed 
once: 

. [/* = [u\ u| ... <] e 
. U' = [u* u* ... u* 1 G 
. = U'^ G 

• Zi = PiXi, and 
. Ef = PAP^ = Af G 

Then, for each iteration of (w, b) G and for each 

training example (distribution), (x^, E^), the projected (normal 
to the separating hyperplane) vector has to be computed: 

W2 = PiW G . 

Finally, the loss function is computed in the low-dimensional 
spaces i = 1,... ,l as shown in (17). The objective func¬ 
tion is computed as shown in (18), while its first derivatives 
are computed as in (21), (22). 

IV. Experiments 

The classification performance of the proposed algorithm is 
initially validated on 2D synthetic data, in order to illustrate 
how the linear SVM-GSU classifier works. To this end, we 
consider binary classification toy experiments and validate on 
them the proposed learning algorithm both in the original 
feature space, as well as in linear subspaces. 

Next, the proposed algorithm is applied to two different, 
challenging learning problems, i.e., the problem of complex 
event detection in video, and the problem of image classifi¬ 
cation of handwritten digits. The large video dataset of the 
TRECVID Multimedia Event Detection (MED) 2014 task is 
used for the event detection experiments (Sect. IV-B), while 
the well-known MNIST database of handwritten digits is used 
for the image classification ones (Sect. IV-C). Eor each of those 
problem domains, a methodology for modeling the uncertainty 
of each input (random) vector is also proposed. 

A. Toy examples using synthetic data 

In this subsection, we present two toy examples that provide 
insights into understanding the way the proposed algorithm 
works. As shown in Eig.2, two toy artificial binary classifica¬ 
tion problems are constructed. Negative samples are denoted 
by red x marks, while positive ones by green crosses. We 
assume that the uncertainty of each training example is given 
via a covariance matrix. In Eig.2a and 2c, the ellipses show 
the iso-density loci of points described by the 0.03% of the 
maximum density of each Gaussian distribution (please note 
that these ellipses are only used for visualization purposes). 
Moreover, in Eig.2b and 2d, the covariance matrices are 
approximated by low-rank matrices (rank one). 

Eor each of the above experiments, a linear baseline SVM 
(LSVM) is trained using solely the centres of the distributions; 
i.e., ignoring the uncertainty of each sample. The resulting 
separating lines are shown in Eig.2 in dashed red. Moreover, 
a linear SVM-GSU (LSVM-GSU) is also trained using the 


centres of the above distributions, and the covariance matrices; 
i.e., using the parameters of the Gaussian distribution followed 
by each training example. LSVM-GSU is trained first in 
the original feature space (M^), and then in linear subspaces 
(M), preserving for each covariance matrix 90% of the total 
variance. The resulting separating lines virtually coincide and 
are shown in Eig.2a and 2c (solid green lines). Einally, the 
resulting separating lines of the SVM-GSUs trained in linear 
subspaces using the low-rank (rank one) covariance matrices 
and preserving 90% of the total variance are shown with 
green lines in Eig.2b and 2d. It is evident that, when the 
uncertainty of the training data is taken into consideration, 
the decision boundaries may change drastically. Einally, the 
proposed algorithm achieves to learn approximately the same 
(or a very similar) separating line, even in the cases where the 
optimization problem is approximated in linear subspaces, or 
the covariance matrices of the input vectors are low-rank. 

B. Video Event Detection 

1) Dataset and experimental setup: Eor experiments on 
video event detection, the large-scale video dataset of the 
TRECVID Multimedia Event Detection (MED) 2014 task [28] 
is used. The ground-truth annotated portion of it consists of 
three different video subsets: the “pre-specified” (PS) video 
subset (2000 videos, 80 hours, 20 event classes), the “ad-hoc” 
(AH) video subset (1000 videos, 40 hours, 10 event classes), 
and the “background” (BG) video subset (5000 videos, 200 
hours). Each video in the above dataset belongs to either 
one of 30 target event classes, or to the “rest of the world” 
(background) class. The above video dataset (PSh-AHh-BG) 
is partitioned such that a training and an evaluation set are 
created, as follows 

• Training Set 

- 50 positive samples per event class, 

- 2496 background samples (negative for all event 
classes). 

. Evaluation Set 

- ~ 50 positive samples per event class, 

- 2496 background samples (negative for all event 
classes). 

A model vector representation scheme is adopted, similarly 
to [29], for representing videos. That is, a set of 346 pre¬ 
existing visual concept detectors (linear SVM classifiers that 
are trained on the TRECVID Semantic Indexing (SIN) 2014 
dataset [29], [28]) is used for deriving a 346-element descrip¬ 
tor vector for each video (hereafter called “model vector”). 
Specifically, each input video stream is initially sampled such 
that a keyframe is generated every 6 seconds. Next, each 
keyframe is processed as discussed above and a keyframe- 
level model vector is computed. Then, a video-level model 
vector for each video is computed by taking the average 
of the corresponding keyframe-level representations. Thus, 
the keyframe-level model vectors can be seen as different 
observations of the model vector which represents each video. 

2) Uncertainty modeling: Let us now define a set X of 
I annotated random vectors representing the aforementioned 
video-level model vectors. Each random vector is distributed 
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(c) (d) 

Fig. 2: Toy experiments illustrating LSVM-GSU (green solid line) learning in the original feature space (a,c), and in linear 
subspaces (b,d), in comparison with the baseline LSVM (red dashed lines). Circled points indicate support vectors as identihed 
by the standard LSVM. 


normally; i.e., for the random vector representing the i- 
th video, X^, we have ~ That is, X = 

Xi G G G {±1}, i = 

For each random vector X^, a number, Ni, of observations, 
{x* G K": f = l,...,iVi} is available (these are the 
keyframe-level model vectors that have been computed). Then, 
the mean vector and the covariance matrix of X^ are computed 
respectively as follows 

= ( 23 ) 

Ni 

= (24) 

t=i 

However, the number of observations per each video that are 
available for our dataset is in most cases much lower than the 
dimensionality of the input space; for instance, the average 
number of observations available for each random vector 
(video-level representation) is approximately 20 model vectors 
(keyframe-level representations), while the dimensionality of 
the input space is n = 346. Consequently, the covariance 
matrices that arise using (24) are typically low-rank; i.e. 
rank(Si) < Ni. To overcome this issue, we assume that the 
desired covariance matrices are diagonal. That is, we require 
that the covariance matrix of the i-th training sample is given 


by 



diag ..., ct”) , 


such that the squared Frobenious norm of the difference — 
Hi is minimized, i.e.. 




argmin 


S, - diag {al, 



It can easily be shown that the above criterion is fulhlled when 
the estimator covariance matrix Hi is equal to the diagonal part 
of the sample covariance matrix Hi, i.e. 

Si = diag(cr,^...,cr”). 


We note that, using this approximation approach, the covari¬ 
ance matrices are diagonal but anisotropic and different for 
each training input example. This is in contrast with other 
methods (e.g. [7], [17]) that assume more restrictive modeling 
approaches for the uncertainty; i.e., isotropic noise for each 
training sample. 

3) Experimental results: Table I shows the performance 
of the proposed linear SVM-GSU (LSVM-GSU) in terms of 
average precision (AP) [30] for each target event in compar¬ 
ison with the baseline linear SVM (LSVM), as well as with 
a linear SVM extension which handles the input uncertainty 
isotropically (LSVM-isotropic) as in [7], [17]. Moreover, for 
each dataset, the mean average precision (MAP) across all 
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target events is reported. The optimization of the C parameter 
for both LSVM and LSVM-GSU is performed using a line 
search on a 3-fold cross-validation procedure, where at each 
fold the training set is split to 70% learning set and 30% 
validation set. 

In Table I, column (a) shows the performance of the 
baseline LSVM when training is carried out using keyframe- 
level model vectors. That is, in this experimental scenario 
we attempt to resemble the case where a standard LSVM is 
trained using all the available observations of each training 
distribution, in contrast with the proposed LSVM-GSU, where 
training is carried out using solely the mean vectors and the 
covariance matrices. In column (b), we report the results of 
the standard LSVMs which were trained using the video¬ 
level representations; that is, solely the mean vectors of 
each distribution. In contrast, by modeling the uncertainty as 
described in the previous section, the proposed LSVM-GSU 
is validated both in the case that learning is carried out in the 
original feature space (column (h)), and in the cases that it 
is approximated in linear subspaces by preserving a certain 
fraction (p) of the total variance of each covariance matrix. 
Columns (d)-(g) show the performance of LSVM-GSU when 
p = 0.75,0.90,0.95, and 0.99, respectively. The performance 
of the SVM extension, described in [7], [17], where uncer¬ 
tainty is modeled isotropically (LSVM-isotropic) is given in 
column (c). The bold-faced numbers indicate the best result 
achieved for each event class. Finally, in column (i), the results 
of the McNemar [31], [32], [33] statistical significance test 
are reported. A * denotes statistically significant differences 
between the proposed LSVM-GSU (learning in original space) 
and baseline LSVM, while a ^ denotes statistically significant 
differences between LSVM-GSU and LSVM-isotropic. 

From the obtained results, we observe that the proposed 
algorithm (learning in the original feature space) achieved 
better detection performance than both LSVM and LSVM- 
isotropic for 22 out of the 30 event classes. The relative 
boost between LSVM-GSU and LSVM, achieved for each 
event class, is shown in column (j) of Table I, while the 
overall best relative performance boost (in MAP) is equal 
to 9.83% and is achieved when LSVM-GSU is learned in 
the original feature space. However, it is worth noting that 
a considerable boost was also achieved when the LSVM-GSU 
is approximated in linear subspaces by preserving the 99% of 
the total variance for each covariance matrix. Furthermore, in 
general we observe that, as the fraction of the total variance 
preserved decreases, the overall detection performance also 
decreases. 


C. Hand-written digit classification 

1) Dataset and experimental setup: The proposed algo¬ 
rithm is also validated on the problem of image classification 
using the MNIST dataset of handwritten digits [34]. The 
MNIST dataset provides a training set of 60000 samples 
(approx. 6000 samples per digit), and a test set of 10000 
samples (approx. 1000 samples per digit). Each sample is 
represented by a 28 x 28 8-bit image. Originally, MNIST 
does not provide any information about the uncertainty of 


each image; some typical examples of the original training 
and testing set images are shown in Fig.3a. 

In order to make the dataset more challenging, as well as 
to model a realistic distortion that may happen to this kind 
of images (scanned handwritten digits), the original MNIST 
dataset was “polluted” with noise. More specifically, each 
image example was rotated by a random angle uniformly 
drawn from the range [—9, -{- 6 ], where 6 is measured in 
degrees. Moreover, each image was translated by a random 
vector t uniformly drawn from where tp is a 

positive integer expressing distance that is measured in pixels. 
We created five different noisy datasets by setting 9 — 15° 
and tp € {3,5,7,9,11}. The polluted datasets (Di to D^, 
respectively) are shown in Table II, where Dq denotes the 
original MNIST dataset. Fig. 3b and 3c show illustrative 
examples of the noisy datasets D 2 (9 = 15°, tp = 5) and 
D 5 (9 = 15°, tp = 11), respectively. Experiments with 9 
in range [5°,25°] gave very similar results, thus we chose to 
solely report the results that correspond to 0 = 15°. 

TABLE II: MNIST “1” versus “7” datasets 


Dataset 

e 

tp 

Do 

0° 

0 

Di 

15° 

3 

Do 

15° 

5 

Da 

15° 

7 

Di 

15° 

9 

Ds 

15° 

11 


We create six different experimental scenarios using the 
above datasets (Dq-D^). Eirst, we defined the problem of 
discriminating the number one (“1”) from the number seven 
(“7”) similarly to [35]. Each class in the training procedure 
consists of 25 samples, randomly chosen from the pool of 
digits one (~ 6K totally) and seven (^ 6K totally), while the 
evaluation of the trained classifier is carried out on the full 
testing set (~ 2K samples). In each experimental scenario we 
report the average of 100 runs. Moreover, in each experimental 
scenario we compare the proposed linear SVM-GSU (LSVM- 
GSU) to the baseline linear SVM (LSVM), as well as to 
LSVM-isotropic ([7], [17]). We report the average precision 
(AP) [30] for each target class, and the mean average precision 
(MAP) across 100 runs. 

2) Uncertainty modeling: In Appendix C, we propose a 
methodology that, given an image, models the distribution 
of the image that results by a random translation of it. The 
methodology is a first-order Taylor approximation, in a way 
similar to one used for optical flow. Then, we can show 
that the image representation is distributed normally with a 
certain mean vector and covariance matrix, which are also 
being evaluated. We use this methodology for modeling the 
uncertainty of each training image in all the experiments 
below. More specifically, we assume that the translation is 
distributed normally as t ^ ^t), where 

Mi = (0,0)^, 
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TABLE I: Event detection performance (AP and MAP) of the linear SVM-GSU compared to the baseline linear SVM and a 
LSVM extension for handling isotropic uncertainty (as in [7], [17]) using the MED14 dataset. 



LSVM 

(AP) 

(c) 

LSVM 

isotropic 

(AP) 

LSVM-GSU 1 

Event 

Class 

Learning in linear snbspaces (AP) 

Learning in the 
original space 



(d) 

(e) 

(f) 

(g) 

(h) 

(i) 

(i) 


(a) 

(b) 


keyframe 

level 

video 

level 


p = 0.75 

p = 0.90 

p = 0.95 

p = 0.99 

AP 

McNemar 

tests 

Boost (%) 

E()21 

0.1130 

0.1862 

0.2018 

0.1156 

0.1073 

0.1200 

0.1565 

0.1994 


7.09 

E022 

0.1244 

0.1262 

0.1492 

0.0863 

0.1107 

0.0971 

0.1610 

0.1583 


25.44 

E023 

0.2680 

0.2593 

0.2647 

0.1570 

0.2585 

0.2432 

0.2452 

0.2733 

* 

5.40 

E()24 

0.0467 

0.0500 

0.0540 

0.0476 

0.0537 

0.0452 

0.0492 

0.0596 


19.20 

E025 

0.0252 

0.0169 

0.0077 

0.0184 

0.0195 

0.0173 

0.0195 

0.0077 


-54.44 

E026 

0.0750 

0.0700 

0.0681 

0.0733 

0.0872 

0.0851 

0.0707 

0.0810 


15.71 

E027 

0.2502 

0.2666 

0.2504 

0.1665 

0.2344 

0.2799 

0.3105 

0.2914 


9.30 

E028 

0.1948 

0.1829 

0.1983 

0.1693 

0.2007 

0.2091 

0.2027 

0.2064 


12.85 

E029 

0.2458 

0.2330 

0.2433 

0.2319 

0.2299 

0.2520 

0.2250 

0.2337 


0.30 

E030 

0.1054 

0.0601 

0.1034 

0.0755 

0.0842 

0.0914 

0.1100 

0.1179 

* 

96.17 

E031 

0.1781 

0.1992 

0.2133 

0.1105 

0.1603 

0.2422 

0.2291 

0.2125 


6.68 

E032 

0.0653 

0.0521 

0.0613 

0.0484 

0.0599 

0.0673 

0.0654 

0.0638 


22.46 

E033 

0.1019 

0.0935 

0.1335 

0.1162 

0.1497 

0.1287 

0.1363 

0.1370 


46.52 

E034 

0.0711 

0.0658 

0.0725 

0.0692 

0.0728 

0.0719 

0.0707 

0.0726 


10.33 

E035 

0.1996 

0.2648 

0.2794 

0.1476 

0.1651 

0.1812 

0.2207 

0.2742 


3.55 

E036 

0.1674 

0.1957 

0.2141 

0.2191 

0.2209 

0.2281 

0.2235 

0.2436 


24.48 

E037 

0.2227 

0.3742 

0.3728 

0.3246 

0.3606 

0.3894 

0.3913 

0.3595 


-3.93 

E038 

0.0567 

0.0791 

0.0360 

0.0719 

0.0692 

0.0732 

0.0680 

0.0757 


-4.30 

E039 

0.2189 

0.2419 

0.2397 

0.1668 

0.1645 

0.1953 

0.2210 

0.2454 


1.45 

E040 

0.0957 

0.0829 

0.1197 

0.1251 

0.1346 

0.1484 

0.1444 

0.1281 


54.52 

E041 

0.0656 

0.0835 

0.0890 

0.0637 

0.0653 

0.0812 

0.0839 

0.0941 


12.69 

E042 

0.0622 

0.0580 

0.0681 

0.0757 

0.0721 

0.0726 

0.0701 

0.0753 


29.83 

E043 

0.2212 

0.2063 

0.1996 

0.1321 

0.1650 

0.2160 

0.2055 

0.1984 


-3.83 

E044 

0.1631 

0.2844 

0.2999 

0.1467 

0.1828 

0.2547 

0.3008 

0.3090 


8.65 

E045 

0.1348 

0.1773 

0.1723 

0.1249 

0.1557 

0.1948 

0.1804 

0.1853 


4.51 

E046 

0.0750 

0.0814 

0.0862 

0.0713 

0.0626 

0.0712 

0.1032 

0.1017 


24.94 

E047 

0.1208 

0.1275 

0.1329 

0.1213 

0.1240 

0.1304 

0.1298 

0.1316 

* 

3.22 

E048 

0.0476 

0.0613 

0.0772 

0.0530 

0.0941 

0.0667 

0.0570 

0.0673 


9.79 

E049 

0.0658 

0.1067 

0.1431 

0.0458 

0.0327 

0.0494 

0.1082 

0.1184 


10.97 

E050 

0.1849 

0.2226 

0.2256 

0.1982 

0.2522 

0.2604 

0.2447 

0.2306 


3.59 

MAP 

0.1322 

0.1503 

0.1592 

0.1191 

0.1383 

0.1521 

0.1601 

0.1651 

9.83 


The variances of the horizontal and the vertical components 
of the translation, namely a\ and cr^, are set to 



where pt is measured in pixels. That is, the covariance 
matrix is set such that the translation falls in the square 
X \—pt,Pt] with probability 99.7%. Eor the exper¬ 
iments described below, this parameter is set to pi = 5 pixels. 
Using the above, the mean vector and covariance matrix of 
the Lth image are given by (34) and (35), respectively, in 
Appendix C. 

3) Experimental results: Table III shows the performance 
of the proposed classifier (LSVM-GSU) in terms of mean 
average precision (MAP) for the problem of discriminating 
digit “1” to “7”, for each dataset defined above (Dq-D^). 
We report the average of 100 runs of each experiment. The 
proposed algorithm is compared both to the baseline linear 
SVM (LSVM), where the uncertainty of each training sample 
is not taken into account, as well as to a linear SVM extension 
where the uncertainty is taken into consideration isotropically 
(LSVM-isotropic) as in [7], [17]. The optimization of the C 
parameter for both LSVM and LSVM-GSU is performed using 
a line search on a 3-fold cross-validation procedure, where 


at each fold the training set is split to 70% learning set and 
30% validation set. The performance of LSVM-GSU when the 
training of each classifier is carried out in the original feature 
space is shown in row 4, and in linear subspaces in row 5. In 
row 5 we report both the classification performance, and in 
parentheses the fraction of variance that resulted in the best 
classification result. 

The performance of the baseline linear SVM is shown 
in the second row, and the performance of the linear SVM 
extension handling the noise isotropically (as in [7], [17]) is 
shown in the third row. Moreover, Pig. 4 shows the results 
of the above experimental scenarios for datasets Dq-D^. The 
horizontal axis of each subfigure describes the fraction of the 
total variance preserved for each covariance matrix (p), while 
the vertical axis shows the respective performance of LSVM- 
GSU with learning in linear subspaces (LSVM-GSU-SLp). 
Purthermore, in each subfigure, for p = 1 we also draw 
the result of the proposed LSVM-GSU in the original feature 
space (denoted with a rhombus), as well as the result of the 
linear SVM extension that handles the uncertainty isotropi¬ 
cally (LSVM-isotropic) [7], [17] (denoted with a star). We 
report the mean, and with an errorbar show the variance of 
the 100 iterations. The performance of the baseline LSVM 
is shown with a solid line, while two dashed lines show 
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Fig. 3: The MNIST dataset of handwritten digits. Illustrative examples of: (a) the original dataset, and the generated noisy 
datasets (b) D 2 (9 = 15°, tp = 5) and (c) D 5 (0 = 15°, tp = 11). 


the corresponding variance of the 100 runs. From the ob¬ 
tained results, we observe that the proposed LSVM-GSU with 
learning in linear subspaces outperforms both the baseline 
LSVM and LSVM-isotropic for all datasets Dq-D^. Moreover, 
LSVM-GSU achieves better classification results than LSVM- 
isotropic in 5 out of 6 datasets, when learning is carried 
out in the original feature space. Finally, all the reported 
results are shown to be statistically significant using the t-test 
[36]; significance values (p-values) were much lower than the 
significance level of 1%, with most values being near 10“^. 

V. Conclusion 

In this paper we proposed a novel classifier that efficiently 
exploits uncertainty in its input under the SVM paradigm. The 
proposed SVM-GSU was validated on the large-scale dataset 
of TRECVID MED 2014 for the problem of video event 
detection, as well as on the MNIST dataset of handwritten 
digits. Eor both of the above problems, a method for model¬ 
ing and estimating the uncertainty of each training example 
was also proposed. As a shown by the experiments, SVM- 
GSU, validated in the video event detection and the image 
classification problems, efficiently takes into consideration the 
uncertainty of the training examples and achieves better detec¬ 
tion or classification performance than the standard SVM, and 
previous SVM extensions that model uncertainty isotropically. 

Appendix A 

On Gaussian-like Integrals over Half-spaces 
Theorem 1. Let X G M" be a random vector that follows 
a multivariate Gaussian distribution with mean vector fj. € 
M" and covariance matrix S € §”+, where S"_(_ denotes the 
space of n X n symmetric positive definite matrices with real 
entries. The probability density function (pdf) of X is given 
by /x: M” ^ M, 

/x(x;pt,S) = I , exp ("-^(x - /x)^E-i(x - pt)") . 

(27r)2|E|2 V ^ / 

Moreover, let TL be the hyperplane given by a. ■ x + b = 0. 
TL divides the Euclidean n-dimensional space into two half¬ 
spaces (an open and a closed one), where the closed upper 


half-space is given by 

U_|_ = {x € M": a • X -f 6 > 0}. 


Then, the function !+■■ 


X K X 


xS!^+) 


defined 


/+(a,6,/x,E) = / (a-x-f-6)/x(x)dx, (25) 

Jn.1 


is equal to 

a - /j, -\- b 
2 


1 -f erf 


a - b' 

•\/2a^Ea/ 


-f 


V a^Sa 


exp - 


a - fi -\- b 
s/ 2 a^ Sa 



(26) 


where erf: M —)■ (—1,1), a; i-A- ^ fo^~* is the so-called 
error function. Moreover, if the half-space is given as the lower 
half-space = {x € M": a • x -f 6 < 0}, then the function 
/_ : (M” X M) X (M” X S”_|_) —)■ M, given by 

/+(a,6,/x,S) = f (a-x + 6)/x(x)dx, (27) 


is equal to 
a - /j, -\- b 


1 — erf 


a - b' 


V2aTEa/J 
V a^Sa 


exp - 


a - fi -\- b 
V 2 a ^Sa 


(28) 


Proof: We begin with the integral in (25). In our approach 
we will need several coordinate transforms. Eirst, we start with 
a translation in order to get rid of the mean: 


y = x-/x<^x = y-fp(. 


Then 


(27r) 


/+(a,5,/x,E) = 

/ (a-y + a-/x + 6)exp (-iy^E-V) ) dy, 

2 E 2 7f2+ \ z J 
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TABLE III; MNIST “1” versus “7” experimental results (MAP). The proposed LSVM-GSU is compared to the baseline linear 
SVM (LSVM), and a linear SVM extension which handles the uncertainty isotropically (LSVM-isotropic), as in [7], [17]. 


Dataset 

Do 

Di 

D2 

Do 

Di 

Do 

LSVM 

0.9952 

0.9362 

0.8240 

0.6830 

0.6558 

0.6027 

LSVM-isotropic 

0.9968 

0.9327 

0.8133 

0.7222 

0.6675 

0.6328 

LSVM-GSU 

Learning in original space 

0.9971 

0.9452 

0.8310 

0.7216 

0.6708 

0.6353 

Learning in linear subspaces 

0.9972 (0.99) 

0.9480 (0.97) 

0.8562 (0.89) 

0.7543 (0.85) 

0.6974 (0.95) 

0.6640 (0.25) 


Dataset 




Dataset D^ 



y 0 


0.81 L 


- LSVM 

0 LSVM-GSU 
^ LSVM-GSU-SL 

p 

^ LSVM-isotropic 


0.4 0.6 0.8 

Variance fraction preserved 


(a) 

Dataset D^ 



0 0.2 0.4 0.6 0.8 1 

Variance fraction preserved 


0.76 

< 0.75 

-&0.74 

I 0.73 
o 

O 0.72 
CL 

CD 0.71 

O) 

I 0.7 

< 0.69 
c 

g 0.68 

0.67 


- LSVM 

0 LSVM-GSU 
■ X LSVM-GSU-SL 

'i- p 

LSVM-isolropic 


0.64, 


(b) 

Dataset D, 



0 0.2 0.4 0.6 0.8 1 

Variance fraction presen/ed 


(C) 

Dataset D„ 

D 



0.67 
0.66 
'0.65 
0.64 
0.63 
' 0.62 
0.61 
0.6 


0.59, 


- LSVM 

' LSVM-GSU 
LSVM-GSU-SL 
■ F 

^ LSVM-isolropic 


0 0.2 0.4 0.6 0.8 1 

Variance fraction preserved 


(d) 


(e) 


(f) 


Fig. 4; Comparisons between the proposed LSVM-GSU, the baseline LSVM, and the LSVM with isotropic noise in (a) the 
original MNIST dataset (Dq), and (b)-(f) the noisy generated datasets Di-D^. 


where Then 


= {y G K" : a-y + a.-fi + b>0}. 

Next, since S G §++. there exist an orthonormal matrix 
U and a diagonal matrix D with positive elements, i.e. the 
eigenvalues of E, such that S = DU. Thus, it holds that 
E-i = {U^DU)-^ = = U^D-^U. Then, 

by letting z = Uy and ai = Ua, we have 

a • y = a^y = a^{U~^U)y = a^U^Uz = aj'^z, 

and 

y^^-^y = y^{U^DU)-^y = 

{y^U^)D-\Uy) = {UyY D-\Uy) = z^D-^z. 


/+(a,6,/x,S) = 

——/ (ai •z-l-a-/x-|-6)exp (--z^Li-^z) ) dz, 
(27r) 2 |E| 2 Jn+ \ z J 

where 

Uj = {z G M" :ai-z + a- /r-|-6> 0}, 

since for the Jacobian J = \U\, it holds that | J| = 1. 

Now, in order to do rescaling, we set z = ZJav and a 2 = 
D^ai. Thus, 

z^D-^z = {D^^r)^ D-'^{D^v) = m^{D^D-^D^)w = v^v. 

Moreover, a^z = aJ{Div) = {D^al)^^r = ajv. Also, 
it holds that = |E|5 and dz = |U5|dv = jEjidv. 
































































Consequently, 


/+(a,&,/x,S) = 


[ (a 2 •v + a-/x + 6 )exp (-^v^v ) dv, 

'T^)^ jQt V 2 y 


where 


= {v G M": a 2 -v + a- /x + 6 > 0 }. 


Appendix B 

On the convexity of the SVM-GSU loss eunction 
Let J' be the objective function of the optimization problem 
(4), as shown in (7). We will show that J' is convex with 
respect to the optimization variables, w and b, over K” x 
K. First, as every norm is convex, and every non-negative 
weighted sum preserves the convexity, it suffices to show that 
£, as shown in (5), is convex with respect to w, b for all 
i = 1, - ■ ■ ,1. We will prove an associated theorem first, which 


Now, let B be an orthogonal matrix such that Bsl 2 = ||a 2 ||e„, * — 1, • • • , L We will prove an associated thee 

which also means that a 2 = B-i|la 2 ||e„ = B ^we will use to prove the convexity of £, Vz. 


Moreover, let m = Bw. Then, 


a2-v = a2V=(B |la2||e„) v = ||a2|le„ (Bv) = ||a2||e„ m. 

Moreover, _ j max ^0,/i(0, x)^/(x) dx, (30) 

v^v = {B^ B)v{Bvy (Bv) = m^m. 

is convex with respect to 6 over if the function h is convex 
with respect to 9 over 

^(a,5,/x,E) = ^ f (||a 2 ||f+a-/r+ 6 )exp dt, Proof: Let A G [0,1] and 6^,62 G Then, 

here = 

+ = {m G M" : |ja 2 ||eT m+a p+b > 0} = M”"! x [c, + 00 ), / ^^ax (o, /i(A0i + (1 - A) 02 , x)/(x) dx 

id c = — . The norm of a 2 can be expressed in terms f / , , \ , 

'a.E as fists < / max(o,AM»„x) /(x)dx 

7B" 

fall = fGOt/a ^ llaaf = (^(D)U^ A 

^{D)^{D)Ua = a^^a, A" ^ 

id thus since h is convex and for p, q, r G M it holds that 

p,T,) = p ^ q j. ^ max( 0 , p) < max( 0 , q) -f max( 0 , r). 

.— j {V a^Eaf + a - fi + b) exp ( —) dt, (29) Moreover, max(0, Xp) = Amax(0,p), for A > 0, p G M, and 

V 2 y 


Theorem 2. Let f: 

T function. Then, f: 1 


1 + be a non-negative, real-valued 
given by 


/+(a,5,/x,E) =(||a 2 ||f+a-/r+ 6 )exp dt, 

where 

fl^ = {m G M”: |ja 2 ||e,[m+a-/r -|-6 > 0} = x[c,+ 00 ), 
and c = — . The norm of a 2 can be expressed in terms 


of a, E as follows 


= y/(B)[/E 


and thus 


= {^(D)Uay(^{D)Ua) = 
^[/Ty/(D) \/(Li){7a = a^Ea, 


/+(a, 6 ,/r, E) = 


where c = — 


_ a-/i-+f) 


, and it is easily evaluated as follows 


fiXOi + (1 - A) 6 / 2 ) < 


1 + erf 


/+(a,fo,/r, E) = 
f a • /r + 6 A 

VV 2 aTEayJ 


a / max(o,/i( 0 i,x)')/(x)dx 

l + erfft^) + f; '• > 

2 L VV 2 aTEayJ ^ ^ n ^ ^ 

,_ / / a 2 \ +(1-A) / max(o,/i( 02 ,x))/(x)dx 

Va^Ea / I a ■ p -y b\ \ Jr" ^ 2 

(^"l^Tia^Eaj j' = A<y(0i) + (1 - A)y.(02). 

Consequently, f is convex with respect to 6 over ■ 

Following similar arpments_as above, for F!_ = {x G M": a- ^he results of the above theorem, by setting /(x) = 

X -f & < 0 }, with - M" X (+00, c], we have ,^^hich is a real-valued, non-negative function (as a 

1 , - / 1 2 A probability density function), and x) = 1 — Pi (w • x +5), 

/_(a, 6 ,/r, E) = J (va^Eaf-|-a-/r-|- 6 ) exp (^“ 2 ^ J ^^which is convex with respect to 0 = (w^, 5 )^ over = 

M" X M, £ is proven to be convex for all i. Consequently, the 


/a^Ea ( a - p ~yb 

"vTia^Ea 


Following similar arguments as above, for n_ = {x G M”: a- 
X -f & < 0}, with ny = X (-foo, c], we have 


which leads to 


a - p-\-b 


1 — erf 


/_(a, 6 ,/r, E) = 
f a - p-\-b\ 
\s/ 2 a^Y.a). 


/a^Ea / ( a - p + b 

"iTTEa 


objective function is convex. That means that every local 
minimum of J' is also a global one. 

Appendix C 

Modeling the uncertainty oe an image 

Let X G M" be an r X r image, where n = r^, given in 
row-wise form as 




G M”, (31) 
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where fj : —)■ M denotes the intensity function of the j-th 

pixel, after a translation by t = {h,v)^. Fig.5 illustrates this 
case of study. 

We will use Taylor’s theorem in order to approximate the 
intensity function. 


then, the image representation can be rewritten as 


/V^/i(0)\ 


X = f (0) + 


VT/,(0) t. 


(33) 



Fig. 5: Image translation by a random vector t. 


Vv^/„(o)y 


Let us now assume that t is a random vector distributed 
normally with mean and covariance matrix Et, i.e. t ~ 
Then, X is also distributed normally with mean 
vector and covariance matrix that are given, respectively, by 


/x = E[X] =f(0) 


/V^/i(0)\ 


V^/,(0) 


Vv^/„(o)y 


E[t], (34) 


and 


E = E 


(X-/x)(X-/r)^ 


/VT/i(0)\ 


/VT/i(0)\ 


The multivariate Taylor’s theorem [37] is given below 
without proof. 

Theorem 3 (Multivariate Taylor’s Theorem). Let t = 
(fi,... ,tn)^ € K” and consider a function f: K" —>■ K. Let 
a = (ai,..., a„)^ € M” and suppose that f is differentiable 
(all first partial derivatives with respect to ... fin exist) in 
an open ball B around a. Then, the first-order case of Taylor’s 
theorem states that: 

If f is differentiable on an open ball B around a 
and t G B, then 


V'/,(0) I Eltt^ 


WfniO)J 


V^/,(0) 


Wfn{0)J 



/VT/i(0)\ 


/VT/i(0)\ 

E = 

V^/,(0) 

St 

V^/,(0) 


\y^fn{0)j 


Vv^/n(0)y 


(35) 


fit) = /(a) + |^(b)(4 - Uk) 


k=l 


dtk 


(32) 


= /(a) + V/(b) • (t - a), 


Thus, by setting t ~ A/'(/X(, Et), it holds that X ~ A/'(/r, E), 
where the mean vector p and the covariance matrix E are 
given by (34) and (35), respectively. 


for some b on the line segment joining a and b. 

We will use the above theorem in order to approximate the 
intensity function of the j-th pixel of the given image; i.e., 
function fj. That is, around a, the intensity is approximated 
as follows 

/j(t) = /j(a) + V/(a) • (t - a), 

by taking b to coincide with a. Consequently, by setting a = 
(0, 0)^ = 0, the above intensity function is approximated by 

/2(t) = /,(0) + V/(0).t. 

Let us define f: —>■ M” given by 

f(t) = (/i(t),---./j(t), ■••./«(*)) , 
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